Automating Exploratory Data Analysis

Paul M Washburn

May 6 2019

auto_explore Machine learning practitioners need first to identify signal in their datasets before building models.

Iteration cycle time matters in the development of machine learning solutions. This work is a first attempt at accelerating this cycle time for a range of dataset types.

It is assumed the user is inside a Jupyter Notebook REPL environment.

Towards Automated Exploration

The primary goal of auto-explore is to to establish a codebase that reduces the effort to produce a reasonable first-pass exploratory data analysis for a variety of dataset types.

This Python library is a first attempt at automating the process of exploratory data analysis – at least as far as computation and visualization is concerned.

Critical thinking is not included.

Potential Benefits of Semi-automated EDA

  • Faster time to insights & modeling
  • Shorter exploratory data analysis turnaround
  • Reliable processes that are vetted & improved over time
  • No need to re-configure old code to new situations
  • Supplies a base for more in-depth analysis

Previous Work in the Space

Overview Simply specify a dataset and a few attributes.

Much of the functionality of this library is to generate visualizations that are useful in understanding data. However there is a great deal of analytical functionality included as well, including some lightweight machine learning.

Linear Regression Analysis

lmplot

Text Analysis

tsne

Time Series Analysis

plot_tseries_over_group_with_histograms

Correlation Analysis

correlation_heatmap

Correlation Analysis

scatterplotmatrix

Categorical Analysis

target_distribution_over_binary_groups

Clustering Analysis

cluster_and_plot_pca1

Clustering Analysis

cluster_and_plot_pca2

Clustering Analysis

elbow

Helper Functions

Helper Functions

Functionality & Use Using the AutopilotExploratoryAnalysis object will make many methods available on your data with minimal set up.

However each module in this library can be used separately without ever instantiating this object.

Streamlining functionality into this object is still in alpha stages and is by no means perfect. True automation is a ways away.

eda.py: Interface to Semi-Automation

Simply specify a DataFrame and a list for each of its binary, categorical, numerical and text columns. If applicable, set the target_col as a list with one element (string).

This object makes available 17 methods for use on the df supplied as an arg. Check out the most recent code on Github.

Other Modules

  • viz.py - Contains all the visualization functions useful in EDA
  • featexp.py - Copy as of April 2019 of the featexp main package code
  • apis.py - Code that fetches data and machine learning models from various sources
  • notebooks.py - Formatting code for inside a Jupyter Notebook REPL environment
  • stats.py - Currently only houses best_theoretical_distribution but this will expand
  • datetime.py - Houses code pertaining to time-series features (e.g. make_calendars)
  • diligence.py - Houses code that performs sanity checks of various sorts

Future Work The goal for this library is to automate as much of the EDA process as possible for as wide a range of dataset types as possible.

The library is not quite there yet.

Until then efforts will be made to abstract and integrate into this library as many EDA tasks as possible, and eventually have a full_suite_report mechanism.